FinTech Fraud Detection — This notebook focuses on the Credit Card Fraud dataset


Step 1: Exploratory Data Analysis (EDA)¶

In this step, we will:

  • Understand the dataset
  • Check for missing values
  • Visualize distributions
  • Identify early indicators of fraudulent transactions
In [2]:
# Load libraries

!pip install plotly

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
import plotly.express as px

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
sns.set_palette("coolwarm")
Requirement already satisfied: plotly in ./myenv/lib/python3.12/site-packages (6.3.1)
Requirement already satisfied: narwhals>=1.15.1 in ./myenv/lib/python3.12/site-packages (from plotly) (2.8.0)
Requirement already satisfied: packaging in ./myenv/lib/python3.12/site-packages (from plotly) (25.0)
In [3]:
# Load the Dataset - Adjust path if necessary

Credit = pd.read_csv("/mnt/c/1.MorganeCanada/Project-2-/Data/CreditCard_FraudDetection.csv")
Credit.head()
Out[3]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0

5 rows × 31 columns

In [4]:
# Basic Overview - We’ll inspect shape, column types, missing values, and a few summary statistics.

print("Shape:", Credit.shape)
print("\nInfo:")
print(Credit.info())
print("\nMissing values:", Credit.isnull().sum().sum())

Credit.describe().T.head(10)
Shape: (284807, 31)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     284807 non-null  float64
 22  V22     284807 non-null  float64
 23  V23     284807 non-null  float64
 24  V24     284807 non-null  float64
 25  V25     284807 non-null  float64
 26  V26     284807 non-null  float64
 27  V27     284807 non-null  float64
 28  V28     284807 non-null  float64
 29  Amount  284807 non-null  float64
 30  Class   284807 non-null  int64  
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
None

Missing values: 0
Out[4]:
count mean std min 25% 50% 75% max
Time 284807.0 9.481386e+04 47488.145955 0.000000 54201.500000 84692.000000 139320.500000 172792.000000
V1 284807.0 1.759088e-12 1.958696 -56.407510 -0.920373 0.018109 1.315642 2.454930
V2 284807.0 -8.251210e-13 1.651309 -72.715728 -0.598550 0.065486 0.803724 22.057729
V3 284807.0 -9.655224e-13 1.516255 -48.325589 -0.890365 0.179846 1.027196 9.382558
V4 284807.0 8.321417e-13 1.415869 -5.683171 -0.848640 -0.019847 0.743341 16.875344
V5 284807.0 1.650335e-13 1.380247 -113.743307 -0.691597 -0.054336 0.611926 34.801666
V6 284807.0 4.248462e-13 1.332271 -26.160506 -0.768296 -0.274187 0.398565 73.301626
V7 284807.0 -3.054652e-13 1.237094 -43.557242 -0.554076 0.040103 0.570436 120.589494
V8 284807.0 8.777941e-14 1.194353 -73.216718 -0.208630 0.022358 0.327346 20.007208
V9 284807.0 -1.179734e-12 1.098632 -13.434066 -0.643098 -0.051429 0.597139 15.594995
In [5]:
# Target Variable Distribution - The dataset is highly imbalanced, which will influence our modelling strategy later.

fig = px.histogram(Credit, x='Class', color='Class',
                   color_discrete_map={0: "skyblue", 1: "red"},
                   title="Fraud (1) vs Non-Fraud (0)",
                   text_auto=True)

# Show percentage in annotation (optional)

fraud_ratio = Credit['Class'].value_counts(normalize=True)[1] * 100
fig.update_layout(
    annotations=[dict(
        x=0.5,
        y=1.05,
        xref='paper',
        yref='paper',
        text=f"Fraudulent transactions represent only {fraud_ratio:.3f}% of total data",
        showarrow=False,
        font=dict(size=14))])
fig.show(renderer="notebook_connected")
In [6]:
# Transaction Amount Distribution

plt.figure(figsize=(8,5))
sns.histplot(Credit['Amount'], bins=100, kde=True)
plt.title("Distribution of Transaction Amounts")
plt.xlabel("Transaction Amount")
plt.show()
No description has been provided for this image
In [7]:
# Temporal Analysis - The Time variable represents seconds elapsed since the first transaction. We create an Hour feature to see if fraud clusters occur at specific times.

Credit['Hour'] = ((Credit['Time'] // 3600) % 24).astype(int)

# Create an interactive histogram

fig = px.histogram(
    Credit, 
    x='Hour', 
    color='Class',
    barmode='group',  # side-by-side bars
    color_discrete_map={0: "skyblue", 1: "red"},
    title="Fraud Frequency by Hour of Day",
    labels={'Class': 'Transaction Class', 'Hour': 'Hour of Day'},
    text_auto=True)

fig.update_layout(
    xaxis=dict(dtick=1),  # show every hour tick
    yaxis_title="Count",
    legend_title="Class")

fig.show(renderer="notebook_connected")
In [8]:
# Correlation Analysis of amount with Fraud by hour

hourly_fraud_corr = Credit.groupby('Hour').apply(lambda x: x['Amount'].corr(x['Class'])).reset_index(name='Correlation')

# Create a color list: red if correlation > 0, blue if < 0

colors = ['#FF6B6B' if val > 0 else '#4D96FF' for val in hourly_fraud_corr['Correlation']]

plt.figure(figsize=(10,5))
sns.barplot(x='Hour', y='Correlation', data=hourly_fraud_corr, palette=colors)

plt.title("Correlation of Amount with Fraud by Hour")
plt.ylabel("Correlation with Fraud (Class)")
plt.xlabel("Hour of Day")
plt.show()
No description has been provided for this image
In [9]:
# Interactive Visualization (Plotly)

import plotly.express as px
import plotly.io as pio

pio.renderers.default = "notebook"  # or "notebook_connected" / "jupyterlab"

fig = px.histogram(Credit, x="Amount", color="Class", nbins=60,
                   barmode="overlay", title="Transaction Amount by Fraud Status",
                   color_discrete_map={'0':'blue','1':'red'})
fig.show(renderer="notebook_connected")

Key Insights

  • The dataset contains highly imbalanced classes (~0.17% fraud).
  • Fraudulent transactions tend to occur more often at certain hours (check correlation plot).
  • Scaling and class balancing will be essential for accurate modeling.

Next step

  • Data cleaning,
  • Feature engineering
  • first ML models.
In [12]:
# Check the new data set

print("CREDIT CARD DATA")
print("Shape:", credit.shape)
print(credit.head())
print("\nColumns:", credit.columns)

print("\n" + "="*60 + "\n")

print("Hour distribution:")
print(credit['hour'].value_counts().sort_index())

print("\nNight transactions count:")
print(credit['is_night'].value_counts())
CREDIT CARD DATA
Shape: (284807, 34)
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V24       V25       V26       V27       V28  \
0  0.098698  0.363787  ...  0.066928  0.128539 -0.189115  0.133558 -0.021053   
1  0.085102 -0.255425  ... -0.339846  0.167170  0.125895 -0.008983  0.014724   
2  0.247676 -1.514654  ... -0.689281 -0.327642 -0.139097 -0.055353 -0.059752   
3  0.377436 -1.387024  ... -1.175575  0.647376 -0.221929  0.062723  0.061458   
4 -0.270533  0.817739  ...  0.141267 -0.206010  0.502292  0.219422  0.215153   

   Amount  Class  hour  is_night  amount_log  
0  149.62      0   0.0         1    5.014760  
1    2.69      0   0.0         1    1.305626  
2  378.66      0   0.0         1    5.939276  
3  123.50      0   0.0         1    4.824306  
4   69.99      0   0.0         1    4.262539  

[5 rows x 34 columns]

Columns: Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class', 'hour', 'is_night', 'amount_log'],
      dtype='object')

============================================================

Hour distribution:
hour
0.0      7695
1.0      4220
2.0      3328
3.0      3492
4.0      2209
5.0      2990
6.0      4101
7.0      7243
8.0     10276
9.0     15838
10.0    16598
11.0    16856
12.0    15420
13.0    15365
14.0    16570
15.0    16461
16.0    16453
17.0    16166
18.0    17039
19.0    15649
20.0    16756
21.0    17703
22.0    15441
23.0    10938
Name: count, dtype: int64

Night transactions count:
is_night
0    230393
1     54414
Name: count, dtype: int64

Are there more frauds during the night?

In [10]:
# Total amount of frauds.

fraud_df = credit[credit["Class"] == 1]
print("Total fraudulent transactions:", len(fraud_df))

# At night?

fraud_night_counts = fraud_df["is_night"].value_counts()
print("Fraud count by night/day:")
print(fraud_night_counts)

fraud_night_percent = fraud_df["is_night"].value_counts(normalize=True) * 100
print("\nFraud percentage by night/day:")
print(fraud_night_percent)

# Particular Hour?

fraud_by_hour = fraud_df.groupby("hour").size()
print(fraud_by_hour)
Total fraudulent transactions: 492
Fraud count by night/day:
is_night
0    329
1    163
Name: count, dtype: int64

Fraud percentage by night/day:
is_night
0    66.869919
1    33.130081
Name: proportion, dtype: float64
hour
0.0      6
1.0     10
2.0     57
3.0     17
4.0     23
5.0     11
6.0      9
7.0     23
8.0      9
9.0     16
10.0     8
11.0    53
12.0    17
13.0    17
14.0    23
15.0    26
16.0    22
17.0    29
18.0    33
19.0    19
20.0    18
21.0    16
22.0     9
23.0    21
dtype: int64